[Attention] Support MTP with DCP #24997

MatthewBonanni · 2025-09-16T20:53:40Z

Purpose

#24453 Added DCP support but did not support query_len > 1. This PR, which depends on a corresponding FlashAttention PR (vllm-project/flash-attention#92), implements a custom causal mask to take advantage of the FlashAttention MLA backend's capability for query_len > 1, thereby enabling MTP.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

gemini-code-assist

Code Review

This pull request aims to enable Multi-Token Parallelism (MTP) with context parallelism by supporting query_len > 1 in the FlashAttention MLA backend. The changes involve removing previous restrictions and adding metadata for a custom causal mask.

I've found a critical issue where the new logic to compute query_base_positions in MLACommonMetadataBuilder is not being used by FlashAttnMLAMetadataBuilder because it overrides the _build_decode method. This will prevent the feature from working as intended. Please see my detailed comment.

gemini-code-assist · 2025-09-16T20:55:35Z

vllm/v1/attention/backends/mla/common.py

+        # Compute DCP query base positions if using DCP
+        query_base_positions = None
+
+        if self.dcp_world_size > 1:
+            query_lens = query_start_loc_cpu[1:] - query_start_loc_cpu[:-1]
+            query_base_positions = (seq_lens_cpu - query_lens).to(
+                seq_lens_device.device)
+
        return MLACommonDecodeMetadata(
            block_table=block_table_tensor,
            seq_lens=seq_lens_device,
+            query_base_positions=query_base_positions,


This change correctly computes query_base_positions for DCP. However, FlashAttnMLAMetadataBuilder in vllm/v1/attention/backends/mla/flashattn_mla.py overrides _build_decode and does not call this base implementation. As a result, query_base_positions will be None for the FlashAttention MLA backend, and the MTP with context parallelism feature will not work correctly.

To fix this, you should move this logic to FlashAttnMLAMetadataBuilder._build_decode or refactor it so that FlashAttnMLAMetadataBuilder can reuse this logic. For example, you could add the logic to FlashAttnMLAMetadataBuilder._build_decode and pass query_base_positions to the FlashAttnMLADecodeMetadata constructor.

mergify · 2025-09-24T15:28:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

MatthewBonanni · 2025-10-02T17:37:48Z

superseded by #25049

MatthewBonanni added 2 commits September 16, 2025 00:50

Add DCP parameters

3380206

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

Remove old assert

b5c8286

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

MatthewBonanni requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 16, 2025 20:53

mergify bot added the v1 label Sep 16, 2025

MatthewBonanni changed the title ~~Support MTP with context parallelism~~ [Attention] Support MTP with context parallelism Sep 16, 2025

MatthewBonanni changed the title ~~[Attention] Support MTP with context parallelism~~ [Attention] Support MTP with DCP Sep 16, 2025

gemini-code-assist bot reviewed Sep 16, 2025

View reviewed changes

MatthewBonanni marked this pull request as draft September 17, 2025 15:32

mergify bot added the needs-rebase label Sep 24, 2025

MatthewBonanni closed this Oct 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Attention] Support MTP with DCP #24997

[Attention] Support MTP with DCP #24997

Uh oh!

MatthewBonanni commented Sep 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

mergify bot commented Sep 24, 2025

Uh oh!

MatthewBonanni commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[Attention] Support MTP with DCP #24997

[Attention] Support MTP with DCP #24997

Uh oh!

Conversation

MatthewBonanni commented Sep 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 24, 2025

Uh oh!

MatthewBonanni commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MatthewBonanni commented Sep 16, 2025 •

edited by github-actions bot

Loading